In this project, I will be investigating some of the data gotten from the Google PlayStore analytics. This data is provided by companies like Sensor Tower or App Annie, which helps to drive development and App marketing strategies to succeed for many companies. It's really a BIG THINGY!
The App and review data was scrapped from the Google PlayStore by Lavanya Gupta in 2018. The original files are listed here.
# import all packages and libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
# Show numeric output in decimal format (e.g. 2.15)
pd.options.display.float_format = '{:,.2f}'.format
# load data
df_gapps = pd.read_csv("googleapps.csv")
df_gapps.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ak Parti Yardım Toplama | SOCIAL | NaN | 0 | 8.70 | 0 | Paid | $13.99 | Teen | Social | July 28, 2017 | 4.1 and up |
| 1 | Ain Arabic Kids Alif Ba ta | FAMILY | NaN | 0 | 33.00 | 0 | Paid | $2.99 | Everyone | Education | April 15, 2016 | 3.0 and up |
| 2 | Popsicle Launcher for Android P 9.0 launcher | PERSONALIZATION | NaN | 0 | 5.50 | 0 | Paid | $1.49 | Everyone | Personalization | July 11, 2018 | 4.2 and up |
| 3 | Command & Conquer: Rivals | FAMILY | NaN | 0 | 19.00 | 0 | NaN | 0 | Everyone 10+ | Strategy | June 28, 2018 | Varies with device |
| 4 | CX Network | BUSINESS | NaN | 0 | 10.00 | 0 | Free | 0 | Everyone | Business | August 6, 2018 | 4.1 and up |
# structure and information of the dataset
print(df_gapps.shape)
print(df_gapps.columns)
print(df_gapps.info())
(10841, 12)
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null int64
4 Size_MBs 10841 non-null float64
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content_Rating 10841 non-null object
9 Genres 10841 non-null object
10 Last_Updated 10841 non-null object
11 Android_Ver 10839 non-null object
dtypes: float64(2), int64(1), object(9)
memory usage: 1016.5+ KB
None
# remove unused columns "Android_Ver" and "Last_Updated"
df_gapps = df_gapps.drop(columns=["Last_Updated", "Android_Ver"], axis=1)
df_gapps.head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ak Parti Yardım Toplama | SOCIAL | NaN | 0 | 8.70 | 0 | Paid | $13.99 | Teen | Social |
| 1 | Ain Arabic Kids Alif Ba ta | FAMILY | NaN | 0 | 33.00 | 0 | Paid | $2.99 | Everyone | Education |
| 2 | Popsicle Launcher for Android P 9.0 launcher | PERSONALIZATION | NaN | 0 | 5.50 | 0 | Paid | $1.49 | Everyone | Personalization |
| 3 | Command & Conquer: Rivals | FAMILY | NaN | 0 | 19.00 | 0 | NaN | 0 | Everyone 10+ | Strategy |
| 4 | CX Network | BUSINESS | NaN | 0 | 10.00 | 0 | Free | 0 | Everyone | Business |
# check for NaN values especially in 'Rating' column
df_gapps['Rating'].isna().sum()
1474
# drop the NaN values
df_gapps_clean = df_gapps.dropna()
df_gapps_clean.shape
(9367, 10)
# check for duplicate records
df_gapps_clean.duplicated().values.any()
True
# check for duplicates especially the 'Instagram' App
df_gapps_clean[df_gapps_clean['App'] == "Instagram"]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10806 | SOCIAL | 4.50 | 66577313 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10808 | SOCIAL | 4.50 | 66577446 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10809 | SOCIAL | 4.50 | 66577313 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10810 | SOCIAL | 4.50 | 66509917 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social |
# to completely remove duplicates, I need to explicitly specified the subsets
df_gapps_clean = df_gapps_clean.drop_duplicates(subset=["App", 'Type', 'Price'])
# check the Instagram App
df_gapps_clean[df_gapps_clean['App'] == 'Instagram']
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10806 | SOCIAL | 4.50 | 66577313 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social |
# type conversion
# remove "," form Installs values
df_gapps_clean.Installs = df_gapps_clean["Installs"].astype(str).str.replace(",", "")
# convert to numeric
df_gapps_clean.Installs = pd.to_numeric(df_gapps_clean["Installs"])
# remove the "$" sign
df_gapps_clean['Price'] = df_gapps_clean['Price'].astype(str).str.replace("$", "")
# convert to numeric
df_gapps_clean['Price'] = pd.to_numeric(df_gapps_clean['Price'])
C:\Users\Moyinoluwa Busola\AppData\Local\Temp\ipykernel_13348\1423137204.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df_gapps_clean['Price'] = df_gapps_clean['Price'].astype(str).str.replace("$", "")
# check for the most expensive Apps
df_gapps_clean[df_gapps_clean["Price"]> 250]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1331 | most expensive app (H) | FAMILY | 4.30 | 6 | 1.50 | 100 | Paid | 399.99 | Everyone | Entertainment |
| 1946 | I am rich (Most expensive app) | FINANCE | 4.10 | 129 | 2.70 | 1000 | Paid | 399.99 | Teen | Finance |
| 2193 | I am extremely Rich | LIFESTYLE | 2.90 | 41 | 2.90 | 1000 | Paid | 379.99 | Everyone | Lifestyle |
| 2394 | I am Rich! | FINANCE | 3.80 | 93 | 22.00 | 1000 | Paid | 399.99 | Everyone | Finance |
| 2461 | I AM RICH PRO PLUS | FINANCE | 4.00 | 36 | 41.00 | 1000 | Paid | 399.99 | Everyone | Finance |
| 2775 | I Am Rich Pro | FAMILY | 4.40 | 201 | 2.70 | 5000 | Paid | 399.99 | Everyone | Entertainment |
| 3114 | I am Rich | FINANCE | 4.30 | 180 | 3.80 | 5000 | Paid | 399.99 | Everyone | Finance |
| 3145 | I am rich(premium) | FINANCE | 3.50 | 472 | 0.94 | 5000 | Paid | 399.99 | Everyone | Finance |
| 3221 | I am Rich Plus | FAMILY | 4.00 | 856 | 8.70 | 10000 | Paid | 399.99 | Everyone | Entertainment |
| 3554 | 💎 I'm rich | LIFESTYLE | 3.80 | 718 | 26.00 | 10000 | Paid | 399.99 | Everyone | Lifestyle |
| 3856 | I am rich VIP | LIFESTYLE | 3.80 | 411 | 2.60 | 10000 | Paid | 299.99 | Everyone | Lifestyle |
| 3897 | I Am Rich | FAMILY | 3.60 | 217 | 4.90 | 10000 | Paid | 389.99 | Everyone | Entertainment |
| 3946 | I'm Rich - Trump Edition | LIFESTYLE | 3.60 | 275 | 7.30 | 10000 | Paid | 400.00 | Everyone | Lifestyle |
| 4606 | I Am Rich Premium | FINANCE | 4.10 | 1867 | 4.70 | 50000 | Paid | 399.99 | Everyone | Finance |
| 5765 | I am rich | LIFESTYLE | 3.80 | 3547 | 1.80 | 100000 | Paid | 399.99 | Everyone | Lifestyle |
# remove Apps which price are greater than 250
df_gapps_clean = df_gapps_clean[df_gapps_clean['Price'] < 250 ]
df_gapps_clean.sort_values('Price', ascending=False).head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2281 | Vargo Anesthesia Mega App | MEDICAL | 4.60 | 92 | 32.00 | 1000 | Paid | 79.99 | Everyone | Medical |
| 1407 | LTC AS Legal | MEDICAL | 4.00 | 6 | 1.30 | 100 | Paid | 39.99 | Everyone | Medical |
| 2629 | I am Rich Person | LIFESTYLE | 4.20 | 134 | 1.80 | 1000 | Paid | 37.99 | Everyone | Lifestyle |
| 2481 | A Manual of Acupuncture | MEDICAL | 3.50 | 214 | 68.00 | 1000 | Paid | 33.99 | Everyone | Medical |
| 2463 | PTA Content Master | MEDICAL | 4.20 | 64 | 41.00 | 1000 | Paid | 29.99 | Everyone | Medical |
# create a column called 'Revenue Estimate'
df_gapps_clean["Revenue_Estimate"] = df_gapps_clean['Price'].mul(df_gapps_clean['Installs'])
df_gapps_clean.sort_values("Revenue_Estimate", ascending=False).head(10)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9220 | Minecraft | FAMILY | 4.50 | 2376564 | 19.00 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 69,900,000.00 |
| 8825 | Hitman Sniper | GAME | 4.60 | 408292 | 29.00 | 10000000 | Paid | 0.99 | Mature 17+ | Action | 9,900,000.00 |
| 7151 | Grand Theft Auto: San Andreas | GAME | 4.40 | 348962 | 26.00 | 1000000 | Paid | 6.99 | Mature 17+ | Action | 6,990,000.00 |
| 7477 | Facetune - For Free | PHOTOGRAPHY | 4.40 | 49553 | 48.00 | 1000000 | Paid | 5.99 | Everyone | Photography | 5,990,000.00 |
| 7977 | Sleep as Android Unlock | LIFESTYLE | 4.50 | 23966 | 0.85 | 1000000 | Paid | 5.99 | Everyone | Lifestyle | 5,990,000.00 |
| 6594 | DraStic DS Emulator | GAME | 4.60 | 87766 | 12.00 | 1000000 | Paid | 4.99 | Everyone | Action | 4,990,000.00 |
| 6082 | Weather Live | WEATHER | 4.50 | 76593 | 4.75 | 500000 | Paid | 5.99 | Everyone | Weather | 2,995,000.00 |
| 7954 | Bloons TD 5 | FAMILY | 4.60 | 190086 | 94.00 | 1000000 | Paid | 2.99 | Everyone | Strategy | 2,990,000.00 |
| 7633 | Five Nights at Freddy's | GAME | 4.60 | 100805 | 50.00 | 1000000 | Paid | 2.99 | Teen | Action | 2,990,000.00 |
| 6746 | Card Wars - Adventure Time | FAMILY | 4.30 | 129603 | 23.00 | 1000000 | Paid | 2.99 | Everyone 10+ | Card;Action & Adventure | 2,990,000.00 |
# Issue: having multiple Genres separated by ;
df_gapps_clean.Genres.value_counts().sort_values(ascending=True)[:5]
Lifestyle;Pretend Play 1 Strategy;Education 1 Adventure;Education 1 Role Playing;Brain Games 1 Tools;Education 1 Name: Genres, dtype: int64
# split the string on ; and then .stack them
stack = df_gapps_clean.Genres.str.split(";", expand=True).stack()
print(f"We now have a single column shape: {stack.shape}")
We now have a single column shape: (8564,)
# number of Genres
numb_genres = stack.value_counts()
print(f"We have: {len(numb_genres)} genres.")
We have: 53 genres.
print(df_gapps_clean.shape)
df_gapps_clean.head()
(8184, 11)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | KBA-EZ Health Guide | MEDICAL | 5.00 | 4 | 25.00 | 1 | Free | 0.00 | Everyone | Medical | 0.00 |
| 28 | Ra Ga Ba | GAME | 5.00 | 2 | 20.00 | 1 | Paid | 1.49 | Everyone | Arcade | 1.49 |
| 47 | Mu.F.O. | GAME | 5.00 | 2 | 16.00 | 1 | Paid | 0.99 | Everyone | Arcade | 0.99 |
| 82 | Brick Breaker BR | GAME | 5.00 | 7 | 19.00 | 5 | Free | 0.00 | Everyone | Arcade | 0.00 |
| 99 | Anatomy & Physiology Vocabulary Exam Review App | MEDICAL | 5.00 | 1 | 4.60 | 5 | Free | 0.00 | Everyone | Medical | 0.00 |
The cleaned dataset now have 8184 rows, 11 columns.
All the features in the dataset will helpful especially those of Price, Type, App, Rating,Category, Installs and Revenue Estimates columns.
Which content have the highest ratings?
content_ratings = df_gapps_clean['Content_Rating'].value_counts()
# plot a pie chart
plt.figure(figsize=(10, 8))
figr = px.pie(labels=content_ratings.index,
values=content_ratings.values,
title="Content Rating",
names=content_ratings.index);
C:\Users\Moyinoluwa Busola\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead. return args["labels"][column]
<Figure size 720x576 with 0 Axes>
# update traces
figr.update_traces(textposition="outside", textinfo='percent + label');
figr.show()
What are the top ten (10) category apps?
top_10_categories = df_gapps_clean['Category'].value_counts()
# plot bar chart
plt.figure(figsize=(12, 10))
plt.xticks(rotation=30)
plt.xlabel("Top 10 Categories")
plt.ylabel("Count")
plt.title("Top 10 Categories for Apps")
fig = plt.bar(x=top_10_categories.index[:10], height=top_10_categories.values[:10]);
plt.show()
What are the top ten (10) genres?
top_10_genres = df_gapps_clean['Genres'].value_counts()
# plot horizontal bar chart
plt.figure(figsize=(10, 8))
plt.yticks(rotation=30)
plt.xlabel("Count")
plt.ylabel("Genres")
plt.title("Top 10 Genres")
plt.barh(top_10_genres[:10].index, top_10_genres[:10].values);
plt.show()
What is the distribution of Size in MBs?
# plot histogram
bins = np.arange(0, df_gapps_clean['Size_MBs'].max()+5, 5)
plt.figure(figsize=(8, 6))
plt.xlabel("Size MBs")
plt.ylabel("Count")
plt.hist(df_gapps_clean["Size_MBs"], bins=bins);
plt.show()
Which are the most downloaded apps? Are they free or paid?
paid_vs_free_apps = df_gapps_clean['Type'].value_counts()
# plotting a donut chart
plt.figure(figsize=(10, 8))
fig = px.pie(labels=paid_vs_free_apps.index,
values=paid_vs_free_apps.values,
hole=0.6,
title="Paid Vs Free Apps",
names=paid_vs_free_apps.index);
# update traces
fig.update_traces(textposition="outside", textinfo="percent + label");
fig.show()
C:\Users\Moyinoluwa Busola\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
<Figure size 720x576 with 0 Axes>
How are Ratings distributed?
# plot histogram
sb.displot(df_gapps_clean['Rating'], height=4.5);
How is the 'Size_MBs' distributed?
# plot histogram
sb.displot(df_gapps_clean["Size_MBs"], kind="hist", height=6);
Most variables are either left skewed or right skewed. There were unusual points in the data which I detected. I performed different transformations on the data in other to get good visualizations.
Yes, I did some operations to adjust or change the form of data in order to capture the main features of focus. The processes helped me to make better observations by using the correct data format. There have been an unusual distributions on the data.
How does the correlation between 'Installs' and number of Apps (Categorie) look like?
# here, I will have to create a dataframe to show this relationship, then groupby 'Category' and sum the number of 'Installs'
category_by_installs = df_gapps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
# groupby Category and count the number of Apps
category_by_num = df_gapps_clean.groupby('Category').agg({'App': pd.Series.count})
# new dataframe
df_appcombined = pd.merge(category_by_num, category_by_installs, on='Category', how='inner')
df_appcombined.head()
| App | Installs | |
|---|---|---|
| Category | ||
| ART_AND_DESIGN | 61 | 114233100 |
| AUTO_AND_VEHICLES | 73 | 53129800 |
| BEAUTY | 42 | 26916200 |
| BOOKS_AND_REFERENCE | 169 | 1665791655 |
| BUSINESS | 262 | 692018120 |
# plot a scatter plot
fig = px.scatter(df_appcombined, # data
x='App', # names
y='Installs',
size="App",
hover_name=df_appcombined.index,
color="Installs",
title="Category Correlation")
fig.update_layout(xaxis_title="Number Of Apps",
yaxis_title="Installs",
yaxis=dict(type="log"))
fig.show()
How many Installs are paid compared to the free installs?
# plot box plot
fig = px.box(df_gapps_clean,
x="Type",
y= "Installs",
color = "Type",
notched=True,
title="How many Installs are paid compared to free?",
)
fig.update_layout(xaxis_title="Type of Install",
yaxis_title="Installs",
yaxis=dict(type=("log")))
fig.show()
How much can 'Paid' apps earn?
# plot box plot
paid_gapps = df_gapps_clean[df_gapps_clean["Type"] == "Paid"]
box = px.box(paid_gapps,
x="Category",
y= "Revenue_Estimate",
title="How much can Paid Apps earn per Category?"
)
box.update_layout(xaxis_title="Category",
yaxis_title="Paid App BallPark Revenue",
xaxis= {'categoryorder': 'min ascending'},
yaxis= dict(type=("log")))
box.show()
Which type of install have majority count in 'Content_Rating'?
# plot bar chart
def plot():
plt.figure(figsize=(5,5))
plt.xticks(rotation=45);
plt.xlabel("Content Rating")
plt.ylabel("Count")
sb.countplot(data=df_gapps_clean, x="Content_Rating", hue="Type");
plot()
A very positive relationship occured within the features. Take for instance, the relationship between Type and Installs. Most Installs are free compared to paid, and also the relationship between Category and Revenue Estimate, especially on paid apps was very varying.
Yes, of course. For example, I thought paid apps would not have a high revenue, but I found out in the family category that, paid installs made a whooping revenue lot of $69.9 Milion.
What is the relationship between App, Category and Type?
# free vs paid apps
free_vs_paid = df_gapps_clean.groupby(["Category", "Type"], as_index=False).agg({"App": pd.Series.count})
# plot a clustured bar chart
plt.figure(figsize=(16, 10))
plt.xlabel("Category")
plt.ylabel("App")
plt.title("Relationship between App, Category and Type.")
plt.xticks(rotation=90)
sb.barplot(data=free_vs_paid, x="Category", y="App", hue="Type");
plt.show()
For most the categorieFrom the above relationship, most of the apps are Free compared to Paid apps.
What is the relationship between Installs, Rating and Price?
# plot
plt.figure(figsize=(10, 7))
plt.xlabel("Rating")
plt.ylabel("Price")
plt.scatter(data=df_gapps_clean, x="Rating", y="Price", c="Installs");
plt.colorbar(label="Installs");
What is the relationship between Installs, Content_Rating and Rating?
# plot
fig = sb.FacetGrid(data=df_gapps_clean, col="Content_Rating");
fig.map(plt.scatter, "Rating", "Installs", alpha=1);
What could a relationship between 6 numeric varibles look like?
# plot the relationship between six (6) numeric variables
apps_stats = ["Rating", "Reviews", "Size_MBs", "Installs", "Price", "Revenue_Estimate"]
def plot_matron():
fig = sb.PairGrid(data=df_gapps_clean, vars=apps_stats)
fig.map_diag(plt.hist);
fig.map_offdiag(plt.scatter);
plot_matron()
How can the relationship between the 6 numeric variables above be clearly shown?
# plot the relationship between the 6 numeric variables above
plt.figure(figsize=(12, 10))
plt.title("A Relationship Between Six (6) Variables")
sb.heatmap(df_gapps_clean[apps_stats].corr(), cmap="viridis_r", vmin=0, annot=True);
plt.show()
Most of the relationships were positive. Yes, there were features that strengthened each other. An instance is when three or more variables are combined, they strengthen the visuals and the observations, hence making better conclusions.
Yes, there were interesting interactions between features. For example, when visualizing the relationship between Installs, Rating and Price, most Installs and Ratings were on free ($0) apps.
- Most of the app Installs from the dataset are free apps compared to paid apps. Although paid apps accumulated a reasonable amount of revenue, paid installs turned out to be largely lower compared to free apps installs.
- On the part of content, 'Everyone' content had a large number of installs compared to other contents.
- Finally, most apps with high ratings and reviews are free, therefore making app development a very exiting field to analyse.
- Assesing
- Cleaning
- Exploring
- Analysing
- Sharing
After loading my dataset, I thoroughly assessed the data and figured out the main features that I will need to explore and analyse, so as to arrive at better conclusions. All the steps I took are well documented.